DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

111

In the following, we prove that the Hessian matrix of the loss function is directly related

to the expectation of the covariance of the gradient. Taking the loss function as the negative

logarithm of the likelihood, let X be a set of input data from the network and p(X; ˆw, ˆα)

be the predicted distribution on X under the parameters of the network are ˆw and ˆα, i.e.,

output logits of the head layer.

By omitting ˆw for simplicity, Fisher’s information on the set of probability distributions

P = {pn(X; ˆα), nN} can be described by a matrix whose value in the i-th row and the

j-th column.

Ii,jα) = EX [log pn(X; ˆα)

ˆαi

log pn(X; ˆα)

ˆαj

].

(4.36)

Recall that N denotes the number of classes described in Eq. 4.21. It is then trivial to prove

that the Fisher information of the probability distribution set P approaches a scaled version

of the Hessian of log-likelihood as

Ii,jα) =EX [2 log pn(X; ˆα)

ˆαiˆαj

].

(4.37)

Let Hi,j denote the second-order partial derivatives

2

ˆαiˆαj . Note that the first derivative

of log-likelihood is

log pn(X; ˆα)

ˆαi

=

∂pn(X; ˆα)

pn(X; ˆα)ˆαi

,

(4.38)

The second derivative is

Hi,j log pn(X; ˆα) = Hi,jpn(X; ˆα)

pn(X; ˆα)

∂pn(X; ˆα)

pn(X; ˆα)ˆαi

∂pn(X; ˆα)

pn(X; ˆα)ˆαj

.

(4.39)

Considering that

EX (Hi,jpn(X; ˆα)

pn(X; ˆα)

) =

Hi,jpn(X; ˆα)

pn(X; ˆα)

pn(X; ˆα)dX

= Hi,j

pn(X; ˆα)dX = 0,

(4.40)

we take the expectation of the second derivative and then obtain the following.

EX (Hi,j log pn(X; ˆα)) =EX { ∂pn(X; ˆα)

pn(X; ˆα)ˆαi

∂pn(X; ˆα)

pn(X; ˆα)ˆαj

}

=EX {∂pn(X; ˆα)

ˆαi

∂pn(X; ˆα)

ˆαj

}.

(4.41)

Thus, an equivalent substitution for the Hessian matrix H ˜

fbα) in Eq. 4.32 is the product

of two first-order derivatives. This concludes the proof that we can use the covariance of

gradients to represent the Hessian matrix for efficient computation.

4.4.6

Decoupled Optimization for Training the DCP-NAS

In this section, we first describe the coupling relationship between the weights and the

architecture parameters in the DCP-NAS. Then we present the decoupled optimization

during backpropagation of the sampled supernet to fully and effectively optimize these two

coupling parameters.

Coupled

models

for

DCP-NAS

Combing

Eq.

4.27

and

Eq.

4.28,

we

first

show how parameters in DCP-NAS are formulated in a coupling relationship as